Data

In todays session we will continue to review the Myc ChIPseq we were working on in our last session.

This include Myc ChIP-seq for MEL and Ch12 celllines as well as their input controls.

Information and files for the Myc ChIPseq in MEL cell line can be found here

Information and files for the Myc ChIPseq in Ch12 cell line can be found here

Input control can be found for MEL cell line can be found here

Input control can be found for Ch12 cell line can be found here.

Quality Control - Quality metrics for ChIP-seq.

The ChIPQC package wraps some of the metrics into a Bioconductor package and takes care to measure these metrics under the appropriate condition.

To run a single sample we can use the ChIPQCsample() function, the relevant unfiltered BAM file and we are recommended to supply a blacklist as a BED file or GRanges and Genome name.

You can find a Blacklist for most genomes at Anshul Kundaje’s site or directly from the Encode websites

QCresult <- ChIPQCsample(reads="/pathTo/myChIPreads.bam",
                         genome="mm10",
                         blacklist = "/pathTo/mm10_Blacklist.bed")

Quality control

We can display our ChIPQCsample object which will show a basic summary of our ChIP-seq quality.

chipqc_MycMel_rep1
##                  ChIPQCsample
## Number of Mapped reads: 17434045
## Number of Mapped reads passing MapQ filter: 14027340
## Percentage Of Reads as Non-Duplicates (NRF): 100(0)
## Percentage Of Reads in Blacklisted Regions: 6
## SSD: 2.04563248560836
## Fragment Length Cross-Coverage: 0.00240586015883558
## Relative Cross-Coverage: 1.36747914104688
## Percentage Of Reads in GenomicFeature:
##                         ProportionOfCounts
## BlackList                       0.06945900
## LongPromoter20000to2000         0.25996440
## Promoters2000to500              0.04602540
## Promoters500                    0.04435780
## All5utrs                        0.02186480
## Alltranscripts                  0.56982008
## Allcds                          0.02862382
## Allintrons                      0.51979606
## All3utrs                        0.02149253
## Percentage Of Reads in Peaks: NA
## Number of Peaks: 0
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences

Quality control

All ChIPQC functions can work with a named list of ChIPQCsample objects to aggregate scores into table as well as plots.

Here we use the QCmetrics() function to ive an overview of quality metrics.

QCmetrics(myQC)
##                          Reads Map% Filt% Dup% ReadL FragL RelCC  SSD RiP%
## Sorted_Myc_Ch12_1.bam 10993821  100  21.9    0    35   186 1.070 3.82   NA
## Sorted_Myc_Ch12_2.bam 10060460  100  18.4    0    35   146 1.310 2.84   NA
## Sorted_Myc_MEL_1.bam  10111080  100  20.8    0    35   177 1.220 3.42   NA
## Sorted_Myc_MEL_2.bam  10686984  100  23.4    0    35   177 1.050 4.20   NA
## Sorted_Input_MEL.bam   8591703  100  23.6    0    36   184 0.620 4.51   NA
## Sorted_Input_Ch12.bam 10429244  100  23.7    0    35   182 0.275 4.26   NA
##                       RiBL%
## Sorted_Myc_Ch12_1.bam  12.8
## Sorted_Myc_Ch12_2.bam  10.1
## Sorted_Myc_MEL_1.bam   11.9
## Sorted_Myc_MEL_2.bam   14.0
## Sorted_Input_MEL.bam   15.9
## Sorted_Input_Ch12.bam  12.8

Quality Control (Assessing fragment length)

The prediction of fragment length is an essential part of ChIP-seq affecting peaks calling, summit identification and coverage profiles.

The use of cross-correlation or cross-coverage allows for an assessment of reads clustering by strand and so a measure of quality.

Quality Control (Assessing fragment length)

  • In ChIP-seq typically short single end reads of dsDNA.

  • dsDNA single end sequencing means
  • 5’ of fragment will be sequenced on “+” strand
  • 3’ of fragment end will be on “-” strand.

  • Although we only have partial sequence of strand, with predicted fragment length we can predicted the whole fragment
  • “+” reads should extend only in positive direction
  • “-” reads only in negative


Quality Control (Assessing fragment length)

offset


Quality Control (Assessing fragment length)

offset


Quality Control (Assessing fragment length)

offset

offset


Quality Control (Assessing fragment length)

offset


Quality Control (Assessing fragment length)

The plotCC function can be used to plot our cross-coverage profiles

The plotCC() function accepts our list of ChIPQC sample objects and a facetBy argument to allow us to group our cross-coverage profiles.

plotCC(myQC,facetBy = "Sample")


Quality Control (Assessing fragment length)

We can include additional metadata to allow us to group our plot in different ways.

We can include metadata as a data.frame where the first column is our sample names.

myMeta <- data.frame(Sample= names(myQC),
                     Tissue=c("Ch12","Ch12","MEL","MEL","MEL","Ch12"),
                     Antibody=c(rep("Myc",4),rep("Input",2)))
myMeta
##                  Sample Tissue Antibody
## 1 Sorted_Myc_Ch12_1.bam   Ch12      Myc
## 2 Sorted_Myc_Ch12_2.bam   Ch12      Myc
## 3  Sorted_Myc_MEL_1.bam    MEL      Myc
## 4  Sorted_Myc_MEL_2.bam    MEL      Myc
## 5  Sorted_Input_MEL.bam    MEL    Input
## 6 Sorted_Input_Ch12.bam   Ch12    Input

Quality Control (Assessing fragment length)

All plots in ChIPQC are in fact built in ggplot2 so we can edit and update our plot like all ggplot objects.

plotCC(myQC,facetBy = "Tissue",addMetaData = myMeta,
       colourBy="Antibody")+theme_bw()+
  ggtitle("ChIPQC results")

Quality Control - Blacklist???

offset

Quality Control - Blacklist affects many metrics.

offset

Quality Control - Standardised Standard Deviation.

Since SSD score is strongly affected by blacklisting it may be necessary to change the axis to see any differences between samples for post-blacklisting scores.

Higher post-blacklisted SSD scores reflect samples with stronger peak signal.

plotSSD(myQC)+xlim(0.2,0.4)

Quality Control - Duplication.

We should compare our duplication rate across samples to identify any sample experiencing overamplification and so potential of a lower complexity.

The flagtagcounts() function reports can report the number of duplicates and total mapped reads and so from there we can calculate our duplication rate.

myFlags <- flagtagcounts(myQC)
myFlags["DuplicateByChIPQC",]/myFlags["Mapped",]
## Sorted_Myc_Ch12_1.bam Sorted_Myc_Ch12_2.bam  Sorted_Myc_MEL_1.bam 
##            0.07203883            0.08633293            0.15987076 
##  Sorted_Myc_MEL_2.bam  Sorted_Input_MEL.bam Sorted_Input_Ch12.bam 
##            0.06253850            0.14269010            0.13873057

Quality Control - Enrichment for reads across genes.

p

Peak calling (Installing MACS2)

testing

testing

Peak calling from R

We can still run our MACS2 from the comfort of R using the system() function.

We simply need to build our MACS2 peak call command in R and pass to the system() function.

myChIP <- "Sorted_Myc_MEL_1.bam"
myControl <- "Sorted_Input_MEL.bam"

macsCommand <- paste0("macs2 callpeak -t ", myChIP,
                      " -n ", "Mel_Rep1",
                      " –-outdir ","PeakDirectory",
                      " -c ", myControl)
system(macsCommand)

Working with Peaks

In addition to the genomic coordinates of peaks, these files contain useful information on the samples, parameters and version used for peak calling at the top.

## [1] "# Command line: callpeak -t Sorted_Myc_MEL_1.bam -n Mel1 -c Sorted_Input_MEL.bam"
## [2] "# ARGUMENTS LIST:"                                                               
## [3] "# name = Mel1"                                                                   
## [4] "# format = AUTO"                                                                 
## [5] "# ChIP-seq file = ['Sorted_Myc_MEL_1.bam']"                                      
## [6] "# control file = ['Sorted_Input_MEL.bam']"                                       
## [7] "# effective genome size = 2.70e+09"                                              
## [8] "# band width = 300"

Working with Peaks - Importing peaks

We can import peak files therefore using read.delim function. Note we have set comment.char argument to # to exclude additional information on peak calling parameters stored within the MACS peak file.

macsPeaks_DF <- read.delim(macsPeaks,comment.char="#")
macsPeaks_DF[1:2,]
##    chr   start     end length abs_summit pileup X.log10.pvalue.
## 1 chr1 4785371 4785642    272    4785563  20.89        10.66553
## 2 chr1 5082993 5083247    255    5083123  33.42        12.68072
##   fold_enrichment X.log10.qvalue.        name
## 1         5.33590         7.37727 Mel1_peak_1
## 2         4.30257         9.27344 Mel1_peak_2

Working with Peaks - Peaks as GRanges

As we have seen before elements in GRanges can accessed and set using various GRanges functions. Here we can deconstruct our object back to contig names and interval ranges.

seqnames(macsPeaks_GR)
## factor-Rle of length 16757 with 21 runs
##   Lengths:   916   822  1795   582   596 ...  1089   836   954   395     7
##   Values :  chr1 chr10 chr11 chr12 chr13 ...  chr7  chr8  chr9  chrX  chrY
## Levels(21): chr1 chr10 chr11 chr12 chr13 ... chr7 chr8 chr9 chrX chrY
ranges(macsPeaks_GR)
## IRanges object with 16757 ranges and 0 metadata columns:
##               start       end     width
##           <integer> <integer> <integer>
##       [1]   4785371   4785642       272
##       [2]   5082993   5083247       255
##       [3]   7397544   7398115       572
##       [4]   7616290   7616433       144
##       [5]   8134747   8134893       147
##       ...       ...       ...       ...
##   [16753]   2657144   2657294       151
##   [16754]  90784142  90784289       148
##   [16755]  90818471  90818771       301
##   [16756]  90824549  90824905       357
##   [16757]  90825407  90825575       169

Working with Peaks - Filter peaks in blacklisted regions.

We will want to remove any peaks overlapping blacklisted regions prior to any downstream analysis. We can do this using simple overlapping with GRanges objects.

library(rtracklayer)
blkList <- import.bed("mm10-blacklist.bed")
macsPeaks_GR <- macsPeaks_GR[!macsPeaks_GR %over% blkList] 

Annotation of peaks to genes

So far we have been working with ChIP-seq peaks corresponding to transcription factor binding. Transcription factors, as implied in the name, can affect the expression of their target genes.

The target of transcription factor is hard to assertain from ChIP-seq data alone and so often we will annotate peaks to genes by a simple set of rules:-

Peaks are typically annotated to a gene if * They overlap the gene. * The gene is the closest (and within a minimum distance.)


Peak annotation

A useful package for annotation of peaks to genes is ChIPseeker.

By using pre-defined annotation in the from of a TXDB object for mouse (mm9 genome), ChIPseeker will provide us with an overview of where peaks land in the gene and distance to TSS sites.

First load the libraries we require for the next part.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
library(GenomeInfoDb)
library(ChIPseeker)

Peak annotation

The annotatePeak function accepts a GRanges object of the regions to annotate, a TXDB object for gene locations and a database object name to retrieve gene names from.

peakAnno <- annotatePeak(macsPeaks_GR, tssRegion=c(-500, 500), 
                         TxDb=TxDb.Mmusculus.UCSC.mm10.knownGene, 
                         annoDb="org.Mm.eg.db")
## >> preparing features information...      2018-05-11 11:51:12 
## >> identifying nearest features...        2018-05-11 11:51:12 
## >> calculating distance from peak to TSS...   2018-05-11 11:51:12 
## >> assigning genomic annotation...        2018-05-11 11:51:12 
## >> adding gene annotation...          2018-05-11 11:51:13 
## >> assigning chromosome lengths           2018-05-11 11:51:14 
## >> done...                    2018-05-11 11:51:14
class(peakAnno)
## [1] "csAnno"
## attr(,"package")
## [1] "ChIPseeker"

Peak annotation

The csAnno object contains the information on annotation of individual peaks to genes.

To extract this from the csAnno object the ChIPseeker functions as.GRanges or as.data.frame can be used to produce the respective object with peaks and their associated genes.

peakAnno_GR <- as.GRanges(peakAnno)
peakAnno_DF <- as.data.frame(peakAnno)

Visualising peak annotation

Now we have the annotated peaks from ChIPseeker we can use some of ChIPseeker’s plotting functions to display distribution of peaks in gene features. Here we use the plotAnnoBar function to plot this as a bar chart but plotAnnoPie would produce a similar plot as a pie chart.

plotAnnoBar(peakAnno)

Visualising peak annotation

ChIPseeker can also offer a succinct plot to describe the overlap between annotations.

upsetplot(peakAnno, vennpie = F)


Time for an exercise.

Link_to_exercises

Link_to_answers